DATAX121-23A (HAM) & (SEC) - Introduction to Statistical Methods
Dungeness crabs are commercially fished between December and June along the Pacific coast of North America. Previously only male crabs were fished, which affected the population’s viability. The fisheries consulted with biologists to set the regulations for fishing female crabs.
One study to help inform these regulations was whether a female crab’s postmolt size is a good predictor of premolt size. This is because the size of a crab’s carapace is often used as a proxy for age.
| Variables | |
|---|---|
| Premolt | A number denoting the size of the carapace before molting (in millimetres) |
| Postmolt | A number denoting the size of the carapace after molting (in millimetres) |
A scatter plot helps us describe the direction (positive or negative) and type of relationship (linear, non-linear, or “none”)
Response variable (Dependent variable)
Explanatory variable (Independent variable)
Recall the straight line equation
\[ y_i = mx_i + c \]
where:
\[ y_i = \beta_0 + \beta_1 \times x_i + \varepsilon_i, ~ \text{where} ~ \varepsilon_i \sim \text{Normal}(0, \sigma_\varepsilon) \]
where:
Is there a natural best guess for \(\beta_0\), \(\beta_1\), and \(\sigma_\varepsilon\) based on the data?
What we could do instead is “fit” the best-fit line, then use an appropriate stopping criteria which tells us when the best-fit line is achieved
The stopping criteria used for regression models involves minimising the “variability” of the residuals, that is, the sum of squares for residuals, \(SSR\)
\[ \DeclareMathOperator*{\argminA}{arg\,min} \begin{aligned} \argminA_{\beta_0,\,\beta_1} SSR, ~ \text{where} ~ SSR &= \sum^n_{i=1}(\varepsilon_i)^2 = \sum^n_{i=1}\{y_i - (\beta_0 + \beta_1\times x_i)\}^2 \end{aligned} \]
\[ \widehat{y}_i = \beta_0 + \beta_1 \times x_i \]
where:
More on 2.
For the best-fit line, that is, a simple linear regression model, this implies that there is a linear association between the response and explanatory variable
So for the best-fit line only, the L.I.N.E. acronym is a convenient way of recalling what the assumptions are